AITopics | visual scene graph

Collaborating Authors

visual scene graph

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multimodal Machine Translation with Visual Scene Graph Pruning

Lu, Chenyu, Sun, Shiliang, Zhao, Jing, Zhang, Nan, Song, Tengfei, Yang, Hao

arXiv.org Artificial IntelligenceMay-27-2025

Multimodal machine translation (MMT) seeks to address the challenges posed by linguistic polysemy and ambiguity in translation tasks by incorporating visual information. A key bottleneck in current MMT research is the effective utilization of visual data. Previous approaches have focused on extracting global or region-level image features and using attention or gating mechanisms for multimodal information fusion. However, these methods have not adequately tackled the issue of visual information redundancy in MMT, nor have they proposed effective solutions. In this paper, we introduce a novel approach--multimodal machine translation with visual Scene Graph Pruning (PSG), which leverages language scene graph information to guide the pruning of redundant nodes in visual scene graphs, thereby reducing noise in downstream translation tasks. Through extensive comparative experiments with state-of-the-art methods and ablation studies, we demonstrate the effectiveness of the PSG model. Our results also highlight the promising potential of visual information pruning in advancing the field of MMT.

artificial intelligence, natural language, translation, (14 more...)

arXiv.org Artificial Intelligence

2505.19507

Genre:

Research Report > New Finding (0.66)
Research Report > Promising Solution (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Visual Environment-Interactive Planning for Embodied Complex-Question Answering

Lan, Ning, Ou, Baoshan, Xie, Xuemei, Shi, Guangming

arXiv.org Artificial IntelligenceApr-1-2025

--This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning . Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper . T o ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. T o test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability. Index T erms --Embodied complex-question answering, task planning, language parsing, structured semantic space. HE development of versatile embodied agents capable of understanding natural language commands in indoor environments and executing various tasks through visual interaction has been a long-standing goal.

large language model, machine learning, question answering, (22 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TCSVT.2025.3538860

2504.00775

Country:

Asia > China > Shaanxi Province > Xi'an (0.05)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.91)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.82)
(3 more...)

Add feedback

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Wang, Zhecan, You, Haoxuan, Li, Liunian Harold, Zareian, Alireza, Park, Suji, Liang, Yiqing, Chang, Kai-Wei, Chang, Shih-Fu

arXiv.org Artificial IntelligenceDec-15-2021

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.

graph, scene graph, visual scene graph, (12 more...)

arXiv.org Artificial Intelligence

2112.08587

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.05)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback